TaxE: a Testbed for Hierarchical Document Classifiers
نویسندگان
چکیده
In the last decade the interest in the hierarchical organization of documents is increased. New challenges arise as hierarchical document classification, both unsupervised and supervised. A recognition of the most recent literature on these topics shows that none of the published works refer to the same dataset to enable the experimental phase. Moreover the papers don’t provide enough details to reproduce the same datasets starting from the same information sources. The drawback is twofold: from one hand the waste of time to preprocess suitable datasets, to the other hand the lack of a common testbed to compare alternative solutions. In this paper we propose a dataset extracted from Google and LookSmart web directories to support the experimentation effort in the field of hierarchical document classification. For such a task we aim to provide a kind of reference corpus in analogy with the role that Reuters plays in the scientific community. The paper illustrates the process performed to generate a well defined dataset. This dataset is freely distributed over the web.
منابع مشابه
Learning Document Image Features With SqueezeNet Convolutional Neural Network
The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...
متن کاملA Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure
Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...
متن کاملPerformance measurement framework for hierarchical text classification
Hierarchical text classification or simply hierarchical classification refers to assigning a document to one or more suitable categories from a hierarchical category space. In our literature survey, we have found that the existing hierarchical classification experiments used a variety of measures to evaluate performance. These performance measures often assume independence between categories an...
متن کاملApplications of Machine Learning to Information Access
The recent explosion of on-line information has given rise to a number of query-based search engines (e.g., Alta Vista) and manually constructed topic hierarchies (e.g., Yuhoo!). But with the current rate of growth in the amount of available information, query results grow incomprehensibly large and manual classification in topic hierarchies creates an immense information bottleneck. Therefore,...
متن کاملSome Issues in the Automatic Classification of U.S. Patents
The classification of U.S. patents poses some special problems due to the enormous size of the corpus, the size and complex hierarchical structure of the classification system, and the size and structure of patent documents. The representation of the complex structure of documents has not received a great deal of previous attention, but we have found it to be an important factor in our work. We...
متن کامل